Introduction to Solving Biological Problems Using R - Day 1

Mark Dunning, Suraj Menon, and Aiora Zabala. Original material by Robert Stojnić, Laurent Gatto, Rob Foy John Davey, Dávid Molnár and Ian Roberts

Last modified: 09 Sep 2015

Course Aims

Day 1 Schedule

  1. Introduction to R and its environment
  2. Data Structures
  3. Data Analysis Example
  4. Plotting in R

1. Introduction to R and its environment

What’s R?

The R-project page

http://www.r-project.org/

R screenshot

R in the New York Times

http://goo.gl/pww4ZO

R new york time

R in Nature

R in nature

R plotting capabilities

http://spatial.ly/2012/02/great-maps-ggplot2/

R cycle journeys

R plotting capabilities

https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/469716398919 R facebook

Who uses R? Not just academics!

http://www.revolutionanalytics.com/companies-using-r

Various platforms supported

Getting started

Launching R Using RStudio

To launch RStudio, find the RStudio icon in the menu bar on the left of the screen and click

RStudio screenshot

The Working Directory (wd)

setwd("R_course/Day_1_scripts")

Basic concepts in R - command line calculation

2 + 2
## [1] 4
20/5 - sqrt(25) + 3^2
## [1] 8
sin(pi/2)
## [1] 1

Note: The number in the square brackets is an indicator of the position in the output. In this case the output is a ‘vector’ of length 1 (i.e. a single number). More on vectors coming up…

Basic concepts in R - variables

x <- 10
x
## [1] 10
myNumber <- 25
myNumber
## [1] 25
sqrt(myNumber)
## [1] 5
x + myNumber
## [1] 35

Basic concepts in R - variables

x <- 21
x
## [1] 21
x <- myNumber
x
## [1] 25
myNumber <- myNumber + sqrt(16)
myNumber
## [1] 29

Basic concepts in R - functions

sin(x)

this returns the sine of x. In this case the function has one argument: x. Arguments are always contained in parentheses – curved brackets, () – separated by commas.

sum(3,4,5,6)
## [1] 18
max(3,4,5,6)
## [1] 6
min(3,4,5,6)
## [1] 3

Basic concepts in R - functions

seq(from = 2, to = 20, by = 4)
## [1]  2  6 10 14 18
seq(2, 20, 4)
## [1]  2  6 10 14 18

Basic concepts in R - vectors

x <- c(3,4,5,6)
x
## [1] 3 4 5 6
x[1]
## [1] 3
x[4]
## [1] 6
y <- c(2,3)
x[y]
## [1] 4 5

Basic concepts in R - vectors

x <- c(3,4,5,6,7,8,9,10,11,12)
x <- 3:12
x
##  [1]  3  4  5  6  7  8  9 10 11 12
x <- seq(2, 20, 4)
x
## [1]  2  6 10 14 18
x <- seq(2, 20, length.out=5)
x
## [1]  2.0  6.5 11.0 15.5 20.0

Basic concepts in R - vectors

y <- rep(3, 5)
y
## [1] 3 3 3 3 3
y <- rep(1:3, 5)
y
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

Basic concepts in R - vectors

x <- 3:12
x[3:7]
## [1] 5 6 7 8 9
x[seq(2, 6, 2)]
## [1] 4 6 8
x[rep(3, 2)]
## [1] 5 5

Basic concepts in R - vectors

y <- c(x, 1)
y
##  [1]  3  4  5  6  7  8  9 10 11 12  1
z <- c(x, y)
z
##  [1]  3  4  5  6  7  8  9 10 11 12  3  4  5  6  7  8  9 10
## [19] 11 12  1

Basic concepts in R - vectors

x <- 3:12
x[-3]
## [1]  3  4  6  7  8  9 10 11 12
x[-(5:7)]
## [1]  3  4  5  6 10 11 12
x[-seq(2, 6, 2)]
## [1]  3  5  7  9 10 11 12

Basic concepts in R - vectors

x[6] <- 4
x
##  [1]  3  4  5  6  7  4  9 10 11 12
x[3:5] <- 1
x
##  [1]  3  4  1  1  1  4  9 10 11 12

Remember! Square brackets for indexing [], parentheses for function arguments ().

Basic concepts in R - vector arithmetic

x <- 1:10
y <- x*2
y
##  [1]  2  4  6  8 10 12 14 16 18 20
z <- x^2
z
##  [1]   1   4   9  16  25  36  49  64  81 100

Basic concepts in R - vector arithmetic

y + z
##  [1]   3   8  15  24  35  48  63  80  99 120
x + 1:2
##  [1]  2  4  4  6  6  8  8 10 10 12
x + 1:3
## Warning in x + 1:3: longer object length is not a
## multiple of shorter object length
##  [1]  2  4  6  5  7  9  8 10 12 11

Basic concepts in R - Character vectors and naming

gene.names <- c("Pax6", "Beta-actin", "FoxP2", "Hox9")
gene.expression <- c(0, 3.2, 1.2, -2)
gene.expression
## [1]  0.0  3.2  1.2 -2.0
names(gene.expression) <- gene.names
gene.expression
##       Pax6 Beta-actin      FoxP2       Hox9 
##        0.0        3.2        1.2       -2.0
names(gene.expression)
## [1] "Pax6"       "Beta-actin" "FoxP2"     
## [4] "Hox9"

Exercise: genes and genomes

Species Genome size (Mb) Protein coding genes
Homo sapiens 3,102 20,774
Mus musculus 2,731 23,139
Drosophila melanogaster 169 13,937
Caenorhabditis elegans 100 20,532
Saccharomyces cerevisiae 12 6,692

Exercise: genes and genomes

Answers to genome exercise

genome.size  <- c(3102, 2731, 169, 100, 12)
coding.genes <- c(20774, 23139, 13937, 20532, 6692)
species.name <- c("H. sapiens","M. musculus",
                  "D. melanogaster","C. elegans",
                  "S. cerevisiae")
names(genome.size)  <- species.name
names(coding.genes) <- species.name
coding.bases <- coding.genes*0.0015
coding.bases
##      H. sapiens     M. musculus D. melanogaster 
##         31.1610         34.7085         20.9055 
##      C. elegans   S. cerevisiae 
##         30.7980         10.0380

Answers to genome exercise

coding.pc <- coding.bases/genome.size*100
coding.pc
##      H. sapiens     M. musculus D. melanogaster 
##        1.004545        1.270908       12.370118 
##      C. elegans   S. cerevisiae 
##       30.798000       83.650000
coding.bases[1]/coding.bases[5]
## H. sapiens 
##   3.104304
genome.size[1]/genome.size[5]
## H. sapiens 
##      258.5
names(coding.pc) <- NULL
coding.pc
## [1]  1.004545  1.270908 12.370118 30.798000
## [5] 83.650000

Writing scripts with RStudio

Typing lots of commands directly to R can be tedious. A better way is to write the commands to a file and then load it into R.

x <- 2 + 2
print(x)

Sourcing can also be performed manually with source("myScript.R")

Getting help

?seq
example(seq)
??plot

Interacting with the R console

R packages

source("http://bioconductor.org/biocLite.R")
biocLite("PackageName")

Exercise: Install packages ggplot2 and DESeq

Exercise: Install packages ggplot2 and DESeq

install.packages("ggplot2")
source("http://www.bioconductor.org/biocLite.R")
biocLite("DESeq")
library(ggplot2) # loads ggplot functions
library(DESeq)   # loads DESeq functions
library()        # Lists all the packages you've got installed 

2. Data structures

R is designed to handle experimental data

The patients data frame

First_Name Second_Name Full_Name Sex Age Weight Consent
1 Adam Jones Adam Jones Male 50 70.8 TRUE
2 Eve Parker Eve Parker Female 21 67.9 TRUE
3 John Evans John Evans Male 35 75.3 FALSE
4 Mary Davis Mary Davis Female 45 61.9 TRUE
5 Peter Baker Peter Baker Male 28 72.4 FALSE
6 Paul Daniels Paul Daniels Male 31 69.9 FALSE
7 Joanna Edwards Joanna Edwards Female 42 63.5 FALSE
8 Matthew Smith Matthew Smith Male 33 71.5 TRUE
9 David Roberts David Roberts Male 57 73.2 FALSE
10 Sally Wilson Sally Wilson Female 62 64.8 TRUE

Character, numeric and logical data types

age    <- c(50, 21, 35, 45, 28, 31, 42, 33, 57, 62)
weight <- c(70.8, 67.9, 75.3, 61.9, 72.4, 69.9, 63.5, 
71.5, 73.2, 64.8)
firstName <- c("Adam", "Eve", "John", "Mary", "Peter", 
"Paul", "Joanna", "Matthew", "David", "Sally")
secondName <- c("Jones", "Parker", "Evans", "Davis", 
"Baker","Daniels", "Edwards", "Smith", "Roberts", "Wilson")
consent <- c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE,
FALSE, TRUE, FALSE, TRUE)

Character, numeric and logical data types

c(20, "a string", TRUE)
## [1] "20"       "a string" "TRUE"
 class(firstName)
## [1] "character"
 class(age)
## [1] "numeric"
 class(weight)
## [1] "numeric"
 class(consent)
## [1] "logical"

Factors

sex <- c("Male", "Female", "Male", "Female", "Male", "Male",
"Female", "Male", "Male", "Female")
sex
##  [1] "Male"   "Female" "Male"   "Female" "Male"   "Male"  
##  [7] "Female" "Male"   "Male"   "Female"
factor(sex)
##  [1] Male   Female Male   Female Male   Male   Female Male  
##  [9] Male   Female
## Levels: Female Male

Creating a data frame (first attempt)

patients <- data.frame(firstName, secondName, 
                       paste(firstName, secondName),  
                       sex, age, weight, consent)
patients
##    firstName secondName paste.firstName..secondName.    sex age weight consent
## 1       Adam      Jones                   Adam Jones   Male  50   70.8    TRUE
## 2        Eve     Parker                   Eve Parker Female  21   67.9    TRUE
## 3       John      Evans                   John Evans   Male  35   75.3   FALSE
## 4       Mary      Davis                   Mary Davis Female  45   61.9    TRUE
## 5      Peter      Baker                  Peter Baker   Male  28   72.4   FALSE
## 6       Paul    Daniels                 Paul Daniels   Male  31   69.9   FALSE
## 7     Joanna    Edwards               Joanna Edwards Female  42   63.5   FALSE
## 8    Matthew      Smith                Matthew Smith   Male  33   71.5    TRUE
## 9      David    Roberts                David Roberts   Male  57   73.2   FALSE
## 10     Sally     Wilson                 Sally Wilson Female  62   64.8    TRUE

Naming data frame variables

patients$age
##  [1] 50 21 35 45 28 31 42 33 57 62
names(patients) <- c("First_Name", "Second_Name", 
"Full_Name", "Sex", "Age", "Weight", "Consent")
names(patients)
## [1] "First_Name"  "Second_Name"
## [3] "Full_Name"   "Sex"        
## [5] "Age"         "Weight"     
## [7] "Consent"

Naming data frame variables

patients <- data.frame(First_Name = firstName, 
                       Second_Name = secondName, 
                       Full_Name = paste(firstName,secondName), 
                       Sex = sex,
                       Age = age,
                       Weight = weight, 
                       Consent = consent)
names(patients)
## [1] "First_Name"  "Second_Name"
## [3] "Full_Name"   "Sex"        
## [5] "Age"         "Weight"     
## [7] "Consent"

Factors in data frames

patients$First_Name
##  [1] Adam    Eve     John    Mary   
##  [5] Peter   Paul    Joanna  Matthew
##  [9] David   Sally  
## 10 Levels: Adam David Eve ... Sally

Factors in data frames

patients <- data.frame(First_Name = firstName, 
                       Second_Name = secondName, 
                       Full_Name = paste(firstName, secondName), 
                       Sex = factor(sex),
                       Age = age,
                       Weight = weight,
                       Consent = consent,
                       stringsAsFactors = FALSE)
 patients$Sex
##  [1] Male   Female Male   Female Male  
##  [6] Male   Female Male   Male   Female
## Levels: Female Male
 patients$First_Name
##  [1] "Adam"    "Eve"     "John"   
##  [4] "Mary"    "Peter"   "Paul"   
##  [7] "Joanna"  "Matthew" "David"  
## [10] "Sally"

Matrices

e <- matrix(1:10, nrow=5, ncol=2)
e
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
rowMeans(e)
## [1] 3.5 4.5 5.5 6.5 7.5

Indexing data frames and matrices

object[rows, colums]

e[1,2]
## [1] 6
e[1,]
## [1] 1 6
patients[1,2]
## [1] "Jones"
patients[1,]
##   First_Name Second_Name  Full_Name
## 1       Adam       Jones Adam Jones
##    Sex Age Weight Consent
## 1 Male  50   70.8    TRUE

Advanced indexing

 s <- letters[1:5]
 s[c(1,3)]
## [1] "a" "c"
 s[c(TRUE, FALSE, TRUE, FALSE, FALSE)]
## [1] "a" "c"

Advanced indexing

a <- 1:5
a < 3
## [1]  TRUE  TRUE FALSE FALSE FALSE
s[a < 3]
## [1] "a" "b"

Operators

s
## [1] "a" "b" "c" "d" "e"
a
## [1] 1 2 3 4 5
s[a > 1 & a <3]
## [1] "b"
s[a == 2]
## [1] "b"

Exercise

source("1.2_patients.R")

Logical indexing answers

patients[patients$Age < 40,]
patients[patients$Consent == TRUE,]
patients[patients$Sex == "Male" & patients$Weight >= 70.8,]

3. R for data analysis

3 steps to Basic Data Analysis

  1. Reading in data
    • read.table()
    • read.csv(), read.delim()
  2. Analysis
    • Manipulating & reshaping the data
    • Any maths you like
    • Plotting the outcome
  3. Writing out results
    • write.table()
    • write.csv()

A simple walkthrough

0. Locate the data

Before we even start the analysis, we need to be sure of where the data are located on our hard drive

getwd()
list.files()

1. Read in the data

rawData <- read.delim("1.3_NBcountData.txt")
myfile <- file.choose()
rawData <- read.delim(myfile)
read.csv("1.3_NBcountData.csv")
?read.table

1b. Check the data

Always check the object to make sure the contents and dimensions are as you expect

rawData[1:10,]  # View the first 10 rows to ensure import is OK
##    Patient Nuclei NB_Amp NB_Nor NB_Del
## 1        1     65      0     63      2
## 2        2     51      3     43      5
## 3        3     37      2     35      0
## 4        4     37      2     35      0
## 5        5     45      2     42      1
## 6        6     46      4     41      1
## 7        7     65      1     64      0
## 8        8     59      1     54      4
## 9        9     49      0     48      1
## 10      10     46      0     45      1
View(rawData)

Word of caution

tolstoy

hadley

Like families, tidy datasets are all alike but every messy dataset is messy in its own way - (Hadley Wickham)

You will make your life a lot easier if you keep your data tidy

http://vimeo.com/33727555

http://vita.had.co.nz/papers/tidy-data.pdf

and organised

http://kbroman.org/dataorg/

Handling missing values

x <- c(1,NA,3)
length(x)
## [1] 3

Handling missing values

mean(x, na.rm=TRUE)
## [1] 2
mean(na.omit(x))
## [1] 2

2. Analysis (reshaping data and maths)

# create an index of results:
prop <- rawData$NB_Amp / rawData$Nuclei 
# Get sample names of amplified patients:
amp <- which(prop > 0.33) 
plot(prop, ylim=c(0,1))
abline(h=0.33) # Add a horizonal line

3. Outputting the results

write.csv(rawData[amp,], file="selectedSamples.csv")
getwd() # print working directory
list.files() # list files in working directory

Data analysis exercise: Which samples are near normal?

Solution to NB normality test

norm <- which(prop < 0.33 & rawData$NB_Del == 0)
norm
##  [1]  3  4  7 15 20 24 36 37 42 47
write.csv(rawData[norm,], "My_NB_output.csv")

4. Plotting in R

Plot basics

Making a Scatter Plot

patients$Weight
##  [1] 70.8 67.9 75.3 61.9 72.4 69.9 63.5
##  [8] 71.5 73.2 64.8
plot(patients$Weight)

Making a Scatter Plot

R tries to guess the most appropriate way to visualise the data

Making a Scatter Plot of two variables

patients$Age
##  [1] 50 21 35 45 28 31 42 33 57 62
plot(patients$Weight,patients$Age)

Making a barplot

barplot(patients$Age)

Making a barplot

barplot(summary(patients$Sex))

Plotting a distribution: Histogram

hist(patients$Weight)

Plotting a distribution: Boxplot

boxplot(patients$Weight)

Plotting a distribution: Boxplot

boxplot(patients$Weight~patients$Sex)

Exercise: Data exploration

Suggestions

weather <- read.csv("ozone.csv")
View(weather)
Ozone Solar.R Wind Temp Month Day
41 190 7.4 67 5 1
36 118 8.0 72 5 2
12 149 12.6 74 5 3
18 313 11.5 62 5 4
NA NA 14.3 56 5 5
28 NA 14.9 66 5 6

Suggestions

plot(weather$Solar.R,weather$Ozone)

Suggestions

hist(weather$Temp)

Suggestions

boxplot(weather$Ozone)

Suggestions

boxplot(weather$Ozone~weather$Month)

Simple customisations

plot(patients$Weight,type="l")

Simple customisations

plot(patients$Weight,type="b")

Simple customisations

plot(patients$Weight,patients$Age,main="Relationship between Weight and Age")

Simple customisations

plot(patients$Weight,patients$Age,xlab="Weight")

Simple customisations

plot(patients$Weight,patients$Age,ylab="Age")

Simple customisations

plot(patients$Weight,patients$Age,
     ylab="Age",
     xlab="Weight",
     main="Relationship between Weight and Age",
     ylim=c(10,70),
     xlim=c(60,80))

Defining a colour

Use of colours

Changing the col argument to plot changes the colour that the points are plotted in

plot(patients$Weight,patients$Age,col="red")

Plotting characters

plot(patients$Weight,patients$Age, pch=16)

Plotting characters

Plotting characters

plot(patients$Weight,patients$Age, pch="X")

Size of points

Character expansion

plot(patients$Weight,patients$Age, cex=2)

Size of points

Character expansion

plot(patients$Weight,patients$Age, cex=0.2)

Colours and characters as vectors

plot(patients$Weight,patients$Age, 
     pch = 1:10,
     col=c("red","orange","green","blue"),
     cex = 1:5)

Other plots use the same arguments

boxplot(patients$Weight~patients$Sex,
        xlab="Sex",
        ylab="Weight",
        main="Relationship between Weight and Gender",
        col=c("blue","yellow"))

Exercise

Solutions

plot(weather$Solar.R,weather$Ozone,col="orange",pch=16,
     ylab="Ozone level",xlab="Solar Radiation", 
     main="Relationship between ozone level and solar radiation")

Solutions

hist(weather$Temp,col="purple",xlab="Temperature",
     main="Distribution of Temperature",breaks = 50:100,freq=FALSE)

Solutions

boxplot(weather$Ozone~weather$Month,col=rainbow(5))

Solutions

library(RColorBrewer)
display.brewer.all()

Solutions

boxplot(weather$Ozone~weather$Month,col=brewer.pal(5,"Set1"))

And finally..

library(palettetown)
boxplot(weather$Ozone~weather$Month,col=pokepal("Pikachu",5))

End of Day 1

To come tomorrow……